PhosVarDeep: deep-learning based prediction of phospho-variants using sequence information

Human DNA sequencing has revealed numerous single nucleotide variants associated with complex diseases. Researchers have shown that these variants have potential effects on protein function, one of which is to disrupt protein phosphorylation. Based on conventional machine learning algorithms, several computational methods for predicting phospho-variants have been developed, but their performance still leaves considerable room for improvement. In recent years, deep learning has been successfully applied in biological sequence analysis with its efficient sequence pattern learning ability, which provides a powerful tool for improving phospho-variant prediction based on protein sequence information. In the study, we present PhosVarDeep, a novel unified deep-learning framework for phospho-variant prediction. PhosVarDeep takes reference and variant sequences as inputs and adopts a Siamese-like CNN architecture containing two identical subnetworks and a prediction module. In each subnetwork, general phosphorylation sequence features are extracted by a pre-trained sequence feature encoding network and then fed into a CNN module for capturing variant-aware phosphorylation sequence features. After that, a prediction module is introduced to integrate the outputs of the two subnetworks and generate the prediction results of phospho-variants. Comprehensive experimental results on phospho-variant data demonstrates that our method significantly improves the prediction performance of phospho-variants and compares favorably with existing conventional machine learning methods.


INTRODUCTION
Nowadays, human DNA sequencing studies have revealed millions of single nucleotide variants, which have been shown to significantly associate with complex diseases such as cancer and cardiovascular diseases (Gonzalez-Perez et al., 2013;MacArthur et al., 2014). Although millions of variants have been discovered, their exact effects on resultant RNA or protein products generally remain unknown (Patrick et al., 2017). One of the potential effects of these variants on protein function is to disrupt post-translational modifications especially protein phosphorylation (Kim et al., 2015;Krassowski et al., 2018), since phosphorylation is the most ubiquitous post-translational modification and plays an important role in understanding the alteration of signaling pathways caused by variations (Reimand & Bader, 2013;Reimand, Wagih & Bader, 2013). Therefore, identifying and understanding variants that affect phosphorylation status is critical in the study of cell biology, disease treatment and prevention. Here, we follow the previous studies (Ryu et al., 2009;Patrick et al., 2017) to use the term 'phospho-variant' to refer to a variant that impacts the phosphorylation status of an amino acid. The examples of phospho-variants used in this paper include those variants that modify the S/T /Y residue or adjacent residue, i.e., Type I, Type II and Type III defined in Ryu et al. (2009).
Indeed, it has been reported that there are numerous phospho-variants that are capable of impacting protein phosphorylation. For example, over 19,000 missense mutations are included in the PhosphoSitePlus PTMVar dataset (Hornbeck et al., 2015), which fall within a 15-residue window centered by an experimentally-identified phosphorylation site and may preeminently disrupt existing phosphorylation sites or introduce new phosphorylation sites. Meanwhile, several databases have also been developed to catalogue the suspected effects of variants on potential phosphorylation sites. For example, Ryu et al. (2009) search for known phospho-variants and predict other possible phospho-variants among human variations by PredPhospho, which are then incorporated into the Phospho-variant database. Subsequently, by string matching with 23,978 phosphorylation sites on human, Ren et al. (2010) detect potential phospho-variants and compile them into the PhosSNP database.
In contrast to the above approaches providing a database, there are several computational methods based on conventional machine learning algorithms for detecting and analyzing phospho-variants. For example, Wagih, Reimand & Bader (2015) developed a Bayesian statistics-based method called mutation impact on phosphorylation (MIMP), which constructs position weight matrices and trains Gaussian mixture models to predict the function of variants on phosphorylation sites. Subsequently, established on the previous Bayesian network model for phosphorylation site prediction, Patrick et al. (2017) presented an effective method called PhosphoPICK-SNP to quantify the expected impacts of variants on protein phosphorylation status. The PhosphoPICK-SNP method obtains prediction scores from a pair of reference and variant protein sequences surrounding a potential phosphorylation site containing missense mutation, and then combines them to analyze the impacts of variation on protein phosphorylation. In this way, Patrick et al. (2017) predict the effects of known phospho-variants on phosphorylation and construct a background distribution of proteome-wide predicted variant effects to detect novel examples of phospho-variants.
Recently, as a rising and promising machine learning technique, deep learning has made a remarkable breakthrough in many areas such as image recognition (Rawat & Wang, 2017) and natural language understanding (Collobert et al., 2011). Compared with conventional machine learning techniques, deep learning methods have a distinctive advantage that allows automatically discovery of the complex representations needed for downstream task. Among them, convolutional neural network (CNN) (Krizhevsky, Sutskever & Hinton, 2017) has been successfully undertaken in biological sequence analysis for its powerful capability of learning sequence patterns. For example, Alipanahi et al. (2015) employed CNN in DeepBind to predict sequence specificities of DNA-and RNA-binding proteins.
In addition, for phosphorylation site prediction, Wang et al. (2017) proposed Musitedeep based on a multi-layer CNN architecture with attention mechanism, and lately Luo et al. (2019) presented DeepPhos using densely connected CNN architectures to learn multiple representations of protein sequences. These deep learning methods using CNN architectures have obtained better performance than conventional machine learning methods. However, so far there is no approach to address the problem of phospho-variant prediction by deep learning and it is nontrivial to develop such a powerful tool that can effectively utilize both reference and variant sequence information to predict impacts of variants on protein phosphorylation status.
In this work, we propose PhosVarDeep, a novel unified deep-learning framework for phospho-variant prediction by efficiently extracting and combining both reference and variant protein sequential information. PhosVarDeep employs a Siamese-like CNN architecture containing two identical subnetworks with shared weights. Each subnetwork is composed of a sequence feature encoding network (PhosFEN) and a multi-layer CNN (CNN module). To begin, we utilize PhosFEN to capture general phosphorylation sequence features of the reference and variant sequences, which are fed into the CNN module to further learn variant-aware phosphorylation sequence features jointly. Then we employ a prediction module to combine the two obtained features from the subnetworks to produce a prediction score that best separates the positive and negative examples of phospho-variants. We conduct comprehensive experiments to study the performance of PhosVarDeep, and the evaluation results exhibit that our proposed method significantly improves the predictive performance in identifying phospho-variants and is superior to existing prediction methods.

Dataset and pre-process
Here, in order to train and evaluate our method, we collect over 2,000 potential phosphovariants in PhosSNP (Ren et al., 2010) as the positive set and adopt three negative sets generated by Patrick et al. (2017), which contain information of protein UniProt index, amino acid variation and phosphorylation sites. The positive dataset in PhosSNP we adopt is based on exact string matching reference/variant sequence with experimentally identified human phosphorylation sites. Accordingly, either identical hit in reference or variant sequence is considered as potential phospho-variants (Ren et al., 2010). As for negative sets, as described in Ralph et al., the three high confidence negative sets adopted in this study are generated by different criteria under the fact that phosphorylation sites are less likely to occur (1) in solvent inaccessible/buried regions of a protein; (2) in transmembrane domains; (3) in proteins that do not interact either directly or through mediators with a kinase (Patrick et al., 2017). Next, we map the collected phospho-variant data with protein sequences by its corresponding UniProt index and amino acid variation so that each phospho-variant corresponds to a pair of reference and variant protein sequences (Bateman et al., 2015). Table 1 shows the sizes of the positive set and three negative sets used for phospho-variant prediction, with respect to different phosphorylation site types, i.e., serine (S)/threonine (T) or tyrosine (Y). Given a phospho-variant, we intercept the protein fragments centered at a phosphorylation site on its corresponding pair of reference and variant sequences. Then each protein fragment is encoded by the one-hot encoding strategy into a L × N two-dimension matrix, where L represents the window size of the protein fragment, and N is set to 21 according to the total number of common amino acids (Min et al., 2017). Besides, we apply CD-HIT with similarity threshold of 40% to all the collected data to reduce the sequence redundancy by following previous studies (Pan et al., 2014;Zhao et al., 2012).
After completing the data pre-processing, the positive set is combined with each negative set to form three phospho-variant datasets, and for each phospho-variant dataset, we train two deep learning models using phospho-variant data on S/T and Y sites respectively. Meanwhile, we use a performance evaluation strategy commonly adopted in deep learning methods for sequential analysis (Zhou & Troyanskaya, 2015;Khurana et al., 2018), that is, each dataset is randomly divided into strictly non-overlapping training, validation and test sets and the ratio is set to 6:2:2 in our study. In this way, we adopt the training data to tune the model weights and utilize the validation data to prevent overfitting (Zhou & Troyanskaya, 2015). The test set is adopted to evaluate the performance of PhosVarDeep and to implement the comparison of PhosVarDeep and other phospho-variant prediction approaches.

Siamese-like architecture of PhosVarDeep
In this study, we design a Siamese-like deep-learning framework for phospho-variant prediction. As a classic metric learning method, Siamese neural network is first introduced in Bromley et al. (1993) and has been successfully adopted in tasks such as signature verification (Chopra, Hadsell & LeCun, 2005), face verification (Cao, Ying & Li, 2013) and object tracking (Bertinetto et al., 2016). To measure similarity between pairs of inputs, Siamese neural network learns a function mapping input patterns to the representation or target space where similar pairs will get close and non-similar pairs will get away from each other (Zagoruyko & Komodakis, 2015). Generally, Siamese neural network contains two subnetworks sharing same configuration, weights, and parameters, which ensures that two similar inputs are transformed into similar feature representations. Here, we adopt a deep-learning framework in Siamese style to learn a complex non-liner mapping for separating the positive and negative samples of the phospho-variants. Figure 1 shows the proposed architecture of PhosVarDeep, which consists of two identical subnetworks with shared weights and a prediction module. Specifically, within each subnetwork, the PhosFEN and the CNN module extract a high dimensional feature representation of an input sequence. Next, the prediction module is introduced to integrate the outputs of the two subnetworks as combined features to generate prediction results of phospho-variants. The details of those networks are described as follows.

PhosFEN
In each subnetwork, PhosFEN presents a sequence feature encoding network to extract the respective features from the reference and variant local sequences of a given phosphovariant. Specifically, the input of PhosFEN is one-hot local sequence e s ∈ R L×21 (s = s 1 ,s 2 ), with s 1 and s 2 being the reference and variant sequences, respectively. Then the sequence features related to phosphorylation are extracted by PhosFEN as follows: where W F represents all the parameter matrices and bias items in PhosFEN. In this way, the feature map f s is exported from PhosFEN as general phosphorylation-related local features, and the pair of f s 1 and f s 2 is used as input to the following CNN module. In this work, since the amount of training data of phospho-variants is much smaller than the size of parameters in deep learning models, overfitting is likely to occur in the training process. In order to solve the problem of insufficient training data, we have studied the feature learning capabilities of the existing deep-learning models for phosphorylation site prediction and adopt DeepPhos (Luo et al., 2019) in our study through transfer learning (Yosinski et al., 2014). DeepPhos is a pre-trained deep learning model consisting of densely connected CNN blocks, which can capture the complex deep sequence representations relevant to phosphorylation better than other deep-learning models. We transfer the whole layers of the model to PhosFEN except its final three layers including the flatten layer, the fully connected layer, and the output layer in our study.

CNN module
After the general phosphorylation sequence features are extracted from PhosFEN, they are fed into the CNN module to generate variant-aware phosphorylation sequence features. As shown in Fig. 1, each CNN module is comprised of multiple convolutional layers and max pooling layers, and a flatten layer. The convolutional layer generates a feature map by convoluting the input with a set of convolution kernels. Mathematically, for input phosphorylation-related features f s (s = s 1 ,s 2 ), let h i,s be the ith feature map in output, and the convolutional layers can be described as follows: where W k i and b k i refer to the parameter matrices and bias items in the ith convolutional layer, with k being the number of convolutional kernels. α represents ReLU activation function that can realize the nonlinear transformation, M refers to the number of convolutional layers and here is set to 3. In this way, the variant-aware phosphorylation sequence features can be generated, and then by the flatten layer, they are transformed into a pair of one-dimensional tensors (h s 1 ,h s 2 ) ∈ R d . In addition, to relieve the risk of overfitting during the training process, a dropout layer is added after each convolutional layer to discard some neurons randomly (Srivastava et al., 2014). The details of the CNN Module are listed in Table 2.

Prediction module
Taking a pair of sequence features h s 1 and h s 2 as inputs, the prediction module utilizes a multi-layer DNN consisting of three fully connected layers to integrate the feature pair and compute a prediction score for phospho-variant. In detail, the two input features are concatenated and then fed into the multi-layer DNN to capture the abstract combined features c v ∈ R u , here u refers to the number of neurons in the final fully connected layer. Next, the output layer with softmax as the activation function is used to generate prediction scores of the positive and negative phospho-variant, which can be calculated as follows: P y = 0|(s 1 ,s 2 ) = 1 − P y = 1|(s 1 ,s 2 ) where W v ∈ R u×2 represents the weight matrix of softmax function. The details of the multi-layer DNN and output layer are listed in Table 2.

Training
PhosVarDeep is a unified deep learning framework for phospho-variant prediction and is trained to classify the phospho-variants into two classes: positive phospho-variants and negative phospho-variants. Accordingly, for the aim of minimizing training error, here we adopt a binary cross-entropy as loss function: where N refers to the size of training data, (s 1 ,s 2 ) j represents the pair of reference and variant sequences for the j th input phospho-variant and y j refers to the corresponding class label. We freeze all the layers of PhosFEN and train the CNN module and prediction module jointly on phospho-variant training data, in which the weights and biases of conventional layers and fully connected layers are the parameters to be estimated. Besides, as a widely used stochastic gradient descent algorithm, Adam optimizer (Kingma & Ba, 2015). is adopted in the training process. At the same time, we use mini-batch training strategy in this study to randomly divide small proportions of the training samples in each iteration into optimizer loops. In addition, to deal with the problem of data imbalance, we follow the previous study (Wang et al., 2017) to apply a bootstrapping strategy in our deep learning method.

Performance assessment
In this study, several widely used measurements are leveraged to assess the performance of our proposed PhosVarDeep, which include specificity (Sp), sensitivity (Sn), precision (Pre), overall accuracy (Acc), F1 scores and Matthew's correlation coefficient (MCC). Their detailed definitions are as below: Acc = TP + TN TP + TN + FP + FN (10) where TP, FP, TN, and FN denote true positives, false positives, true negatives, and false negatives, respectively. The other measurements are calculated based on TP, FP, TN, and FN. Moreover, we plot the receiver operating characteristic curve (ROC) and calculate the area under ROC curve (AUC) to evaluate the overall performance.

Determining the PhosFEN model
To begin, in order to extract general phosphorylation sequence features, we transfer phosphorylation prediction models to PhosFEN in our method. There are several pre-trained models with good phosphorylation-related sequence feature learning capabilities. Among them, Musitedeep (Wang et al., 2017) and DeepPhos (Luo et al., 2019) are both CNN models achieving significantly better performance than previous methods. To determine which model provides more useful information for phosphovariant prediction, we compare the two phosphorylation site predictors using an existing comprehensive phosphorylation site dataset (Luo et al., 2019) collected from Phospho.ELM, PhosphositePlus, HPRD, dbPTM and SysPTM. The Roc curves and AUC values of the two models are shown in Fig. 2. It is obvious that DeepPhos consistently achieved higher performance on both S/T sites and Y sites than Musitedeep. For S/T sites, the AUC value of DeepPhos is 3.0% higher than that of Musitedeep. For Y sites, compared with Musitedeep, the AUC value obtained by DeepPhos is increased by 5.7%. The above results suggest that DeepPhos has better phosphorylation-related feature learning capability on phospho-variant data. Accordingly, for accurate phospho-variant prediction, we employ pre-trained DeepPhos in PhosFEN to capture general phosphorylation sequence features in phospho-variant data.

Performance evaluation of PhosVarDeep
In this part, to evaluate the performance of PhosVarDeep in extracting and integrating reference and variant sequence information for phospho-variant prediction, we conduct an ablation study by comparing three different model configurations: (1) PhosFEN*: in this case, the paired outputs of PhosFEN are directly combined and fed into a fully connected layer to get the prediction results; (2) PM*: in this case, we employ the prediction module to integrate the outputs of PhosFEN to predict phospho-variant; (3) PhosVarDeep.: in this case, for the paired outputs of PhosFEN, we utilize the CNN module to further learn variant-aware phosphorylation sequence features and employ the prediction module to generate combined features for final prediction. We apply all above methods on three phospho-variant test sets and Table 3 lists their AUC values on S/T and Y sites. It is obvious that PM* produces higher performance  In addition to AUC values, we also utilize Sp, Sn, Acc, Pre, F1 and MCC to verify the effectiveness of the proposed method. Similar to previous studies (Luo et al., 2019;Wang et al., 2021), we calculate the values of other measurements when Sp is equal to high

Comparison with existing methods
To further assess the performance of PhosVarDeep for phospho-variant prediction, we compare it against MIMP (Wagih, Reimand & Bader, 2015), a sequence-based method and PhosphoPICK-SNP (Patrick et al., 2017), a motif analysis and context-based method.
It is of note that we follow (Patrick et al., 2017) to perform the comparison using those phospho-variants for which we can obtain prediction results from MIMP. As shown in Table 6, PhosVarDeep obtains the best performance on AUC values in comparison with other methods. Take test set2 as an example, compared with MIMP and PhosphoPICK-SNP, on S/T sites PhosVarDeep obtains 12.6% and 10.6% improvement for AUC values, respectively. Also, on Y sites, the AUC value achieved by PhosVarDeep is 0.922, which has an improvement of 9.5% over the next-best method. Furthermore, we compute the values of Sn, Acc, Pre, MCC and F1 for all the methods on the test sets on S/T sites, and the results are shown in Fig. 3. It can be observed that PhosVarDeep consistently performs better on all the metrics than MIMP and PhosphoPICK-SNP.  Table 7. The closer the prediction score is to 1.0, the more likely it is to change the phosphorylation state. From Table 7, we can see that PhosVarDeep remarkably outperforms MIMP for predicting the experimentally confirmed positive examples of phospho-variants. To sum up, these results indicate that PhosVarDeep is a highly competitive and efficient method in predicting phospho-variant based on sequence information.

Visualization of learned features
The combined features extracted by PhosVarDeep and the original combined one-hot encoding features of reference and variant sequences are visualized in this section to intuitively shows the ability of our proposed deep learning method in phospho-variant prediction. Here, we use a popular visualization algorithm t-SNE (Van Der Maaten & Hinton, 2008) and observe the difference between positive and negative examples of phospho-variants. Take test set3 as an example, the original combined sequence features and the abstract combined features extracted by our model are shown in Fig. 4. It suggests that we can hardly separate the positive examples of phospho-variant from negatives by original combined features, while as the extracted combined features show separate trends, we can distinguish these two classes more clearly by the deep representations of PhosVarDeep. These results demonstrate that original reference and variant sequences can be combined and transformed into meaningful representations with stronger discriminant power by PhosVarDeep, which can be helpful for further analysis of phospho-variant prediction.

DISCUSSION
In this study, we present PhosVarDeep, a unified deep-learning framework based on sequence information for accurate phospho-variant prediction. The experimental results demonstrate that PhosVarDeep obtains better performance than existing phospho-variant prediction methods evaluated by three test sets. In addition to performance metrics, we also generate the visualization results by t-SNE, which show PhosVarDeep can transform protein sequences to meaningful representations with strong discriminant power for phospho-variant prediction. The key contributions of our work are summarized as follows: (1) we exploit a Siamese-like deep neural network architecture with two identical subnetworks and a prediction module for phospho-variant prediction, which allows us to learn deep features on a pair of reference and variant protein sequences of each phosphovariant jointly, (2) we employ a deep neural feature encoding network PhosFEN in each subnetwork to capture sequence features related to phosphorylation by transfer learning, (3) we design the CNN module with two parallel multi-layer CNNs to learn variantaware phosphorylation sequence features that are useful for phospho-variant prediction, (4) by effectively integrating the outputs of the above architecture with the prediction module, our method achieves remarkable performance for classifying phospho-variants on comprehensive experiments. Although PhosVarDeep has shown promising performance in predicting phosphovariants, there is still considerable room for further improvement. Firstly, some cellular context information (e.g., protein-protein interactions) is also useful for predicting phospho-variants (Patrick et al., 2017), which can be further integrated with sequence information in our future work. Secondly, as the deep learning method is still a black-box lacking interpretability (Ma et al., 2018), our method faces great challenges in explaining meaningful biological processes. In the future work, we will modify the framework to make our model more interpretable and realizable by combing some other modules, such as attention mechanisms (Mnih et al., 2014). Thirdly, the unsupervised approach to distinguishing potential phospho-variants is a promising alternative to solve the problem about the absence of a suitably large positive training set, which can be further used in our future study. Finally, as an effective computational approach for exacting and combining reference and variant sequence information, PhosVarDeep can be further advanced and extended to other types of variant prediction tasks.